Data Visualisation and Communications

CA1: Data Journalism With Jupyter Notebook

data_journalism

(image retrieved from wikipedia)

For this assignment, the tools used to explore the data and extract logical progressions of information will be the ggplot2 and pandas libraries, for R and Python respectively. They will be utilised together using Jupyter Notebook, which allows data to be effortlessly passed between the separate interpreter kernels.

The ggplot2 package is a plotting library for R, created by Hadley Wickham. It sits alongside his other 'Tidyverse' packages to present a uniform syntax for manipulating, plotting and analysing data. The ggplot2 library itself represents a programmatic implementation of Leland Wilkinson's 'Grammer Of Graphics'. Leland himself as also held senior positions at both Tableau and H2O.ai.

The pandas library is a package for Python which facilitates the efficient handling of large scale data, through the use of its 'DataFrame' object and many associated methods.

Firstly, a few environmental preparations & loading required libraries:

In [1]:
%cd /mnt/Data/Jupyter/VisCA1
/mnt/Data/Jupyter/VisCA1
In [2]:
import pandas as pd
In [3]:
import warnings
warnings.filterwarnings("ignore")
%load_ext rpy2.ipython
In [4]:
%%R
require(tidyverse)
require(ggplot2)
require(plotluck)
require(reshape2)
require(cowplot)
theme_set(theme_minimal())

Exercise 1

Using Entity Type, create a tree map showing the sum of total-lobbying by type of government.

• What did you find out?

• What other questions would you want to answer?

• What calculations could you create that would inform your reporting?

First, the data must be read in.

In [5]:
lobbying = pd.read_excel("lobbying_data.xlsx")
lobbying.tail()
Out[5]:
Entity-cleaned Entity Compensation Expenses entity_type fips City County State
126 Grays Harbor Public Utility District Grays Harbor Public Utility District 1078.0 190.0 UTILITY DISTRICTS NaN NaN NaN WA
127 Lewis Public Utility District Lewis Public Utility District 13500.0 0.0 UTILITY DISTRICTS NaN NaN NaN WA
128 Okanogan Public Utility District Okanogan Public Utility District 9300.0 0.0 UTILITY DISTRICTS NaN NaN NaN WA
129 Public Utility District #1 Of Cowlitz County Public Utility District #1 Of Cowlitz County 12000.0 0.0 UTILITY DISTRICTS NaN NaN NaN WA
130 Snohomish Public Utility District Snohomish Public Utility District 15000.0 0.0 UTILITY DISTRICTS NaN NaN NaN WA

Next, the monetary columns are aggregated to provide an overview of lobbying spending (a simple and quick calculation to perform with pandas), unnecessary columns are dropped and the column headers tidied up.

In [6]:
lobbying["Total-Lobbying"] = lobbying["Compensation"] + lobbying["Expenses"]
lobbyingTree = lobbying[["entity_type", "Entity-cleaned", "Total-Lobbying"]]
lobbyingTree.rename(columns={"entity_type":"Type", "Entity-cleaned":"Entity", "Total-Lobbying":"Lobbying"}, inplace=True)
lobbyingTree.head()
Out[6]:
Type Entity Lobbying
0 CITIES Algona 18000.0
1 CITIES Arlington 13687.5
2 CITIES Auburn 1900.0
3 CITIES Battle Ground 12000.0
4 CITIES Bellevue 19000.0

The data is now passed to R and entered into a 'data.tree' structure. This type of data structure is well optimised to hold entity relted data of this type.

In [7]:
%%R -i lobbyingTree
require(data.tree)

lobbyingTree$pathString <- paste('LOBBYING_TREE', lobbyingTree$Type, lobbyingTree$Entity, sep = "/")
tree <- as.Node(lobbyingTree[,])
print(tree, pruneMethod='dist', limit=20)
                                                               levelName
1  LOBBYING_TREE                                                        
2   ¦--CITIES                                                           
3   ¦   ¦--Algona                                                       
4   ¦   ¦--Arlington                                                    
5   ¦   °--... 48 nodes w/ 0 sub                                        
6   ¦--COUNTIES                                                         
7   ¦   ¦--Asotin County                                                
8   ¦   ¦--Clark County                                                 
9   ¦   °--... 16 nodes w/ 0 sub                                        
10  ¦--OTHER                                                            
11  ¦   ¦--Association Of Washington Cities                             
12  ¦   ¦--Association Of Washington State Public Facilities Districts  
13  ¦   °--... 18 nodes w/ 0 sub                                        
14  ¦--PORTS                                                            
15  ¦   ¦--Port Of Bellingham                                           
16  ¦   ¦--Port Of Bremerton                                            
17  ¦   °--... 9 nodes w/ 0 sub                                         
18  ¦--PUBLIC FACILITIES DISTRICTS                                      
19  ¦   ¦--Spokane Public Facilities District                           
20  ¦   °--Washington State Convention Center Public Facilities District
21  ¦--SCHOOL DISTRICTS                                                 
22  ¦   ¦--Seattle Public Schools                                       
23  ¦   °--... 3 nodes w/ 0 sub                                         
24  ¦--TRIBES                                                           
25  ¦   °--... 17 nodes w/ 0 sub                                        
26  °--UTILITY DISTRICTS                                                
27      °--... 9 nodes w/ 0 sub                                         

Next, the total lobbying values are entered and aggregated through the different levels of the tree.

In [8]:
%%R
tree$Do(function(x) x$aggLobbying <- Aggregate(x, 'Lobbying', sum))
tree$Sort(attribute='Lobbying', decreasing=TRUE, recursive=TRUE)
print(tree, 'Lobbying', 'aggLobbying', pruneMethod='dist', limit=14)
                                      levelName  Lobbying aggLobbying
1  LOBBYING_TREE                                       NA  2523089.92
2   ¦--CITIES                                          NA   870602.31
3   ¦   ¦--Seattle                              138092.00   138092.00
4   ¦   ¦--Tacoma                                74100.00    74100.00
5   ¦   °--... 48 nodes w/ 0 sub                       NA          NA
6   ¦--COUNTIES                                        NA   321889.00
7   ¦   ¦--King County                           96654.00    96654.00
8   ¦   ¦--Pierce County                         67802.00    67802.00
9   ¦   °--... 16 nodes w/ 0 sub                       NA          NA
10  ¦--OTHER                                           NA   408145.84
11  ¦   ¦--Washington Indian Gaming Association  79319.35    79319.35
12  ¦   °--... 19 nodes w/ 0 sub                       NA          NA
13  ¦--PORTS                                           NA   218413.61
14  ¦   °--... 11 nodes w/ 0 sub                       NA          NA
15  ¦--PUBLIC FACILITIES DISTRICTS                     NA    24073.45
16  ¦   °--... 2 nodes w/ 0 sub                        NA          NA
17  ¦--SCHOOL DISTRICTS                                NA    60500.00
18  ¦   °--... 4 nodes w/ 0 sub                        NA          NA
19  ¦--TRIBES                                          NA   490778.71
20  ¦   °--... 17 nodes w/ 0 sub                       NA          NA
21  °--UTILITY DISTRICTS                               NA   128687.00
22      °--... 9 nodes w/ 0 sub                        NA          NA

The data is now in the correct format to be plotted as a Treemap. The 'tremap' library is used to created the plot, which is displayed below.

In [9]:
%%R -w 8 -h 5 --units in -r 150
require(treemap)

treemap(ToDataFrameNetwork(tree, 'Lobbying', direction='climb')[9:139,], 
        index=c('from', 'to'), vSize='Lobbying', vColor='Lobbying', type='value', palette=heat.colors(-55))

An alternative way to view this data (which conveys the same information) is as a graph. Below, the igraph library is used to present the data in a different visual format.

In [10]:
%%R -w 12 -h 12 --units in -r 100
require(igraph)

treeGraph <- as.igraph(tree, 'aggLobbying', directed=F, direction='climb')
vertex_attr(treeGraph, 'aggLobbying')[1] <- 0

plot(treeGraph, vertex.color='lightblue', vertex.label.family='Segoe UI', 
     vertex.label.cex=0.55, vertex.label.color='black', layout=layout.kamada.kawai,
     vertex.size=vertex_attr(treeGraph, 'aggLobbying') / (max(vertex_attr(treeGraph, 'aggLobbying')) * 0.035))

These visualisations show that a large proportion of the lobbying money is coming from the cities entity-type. Further, it is clear to see that by far the largest monetary amount has come from Seattle. While this is intesting, it would be more revealing to look at how these payments were distributed geographically. The next question will explore this objective.

Exercise 2

• Build out your map of the city lobbying data. Write a brief memo (approximately ½ page) of key findings and

• describe what steps you would take to report this out for a story.

In order to explore this data geographically, it is necessary to introduce geographical (longitude/latitude) columns into the dataframe. In this case, this data will be joined onto the current dataset from a publicly available dataset containing American cities, states, counties, zipcodes and long/lat columns.

The .csv file is read into pandas and filtered down to remove all non-Washington addresses. Next, all rows which contain a duplicate of the 'city' field are removed (to provide a clean, singular location). The city is then set as the dataframe index in parparation of being joined onto the lobbying dataframe.

In [11]:
longlat = pd.read_csv("zip_codes_states.csv")
longlat = longlat[longlat["state"] == "WA"]
longlat.drop_duplicates(subset="city", inplace=True)
longlat.set_index("city", inplace=True)
longlat.head()
Out[11]:
zip_code latitude longitude state county
city
Auburn 98001 47.465495 -121.821908 WA King
Federal Way 98003 47.432251 -121.803388 WA King
Bellevue 98004 47.615471 -122.207221 WA King
Black Diamond 98010 47.310568 -121.998721 WA King
Bothell 98011 47.750689 -122.214376 WA King

The dataframes are now joined and column names cleaned up. In order to ensure the dataset is clean a query is performed to return all rows containing NaN values.

In [12]:
lobbyingGeo = lobbying[lobbying["entity_type"] == "CITIES"].join(longlat, how="left", on="City")[
    ["Total-Lobbying", "City", "county", "longitude", "latitude"]]
lobbyingGeo.rename(columns={"Total-Lobbying":"lobbying", "City":"city"}, inplace=True)
lobbyingGeo.drop(lobbyingGeo.index[44], inplace=True) # duplicate row
lobbyingGeo[lobbyingGeo.isnull().any(1)]
Out[12]:
lobbying city county longitude latitude
0 18000.00 Algona NaN NaN NaN
6 18000.00 Burien NaN NaN NaN
8 7166.68 Covington NaN NaN NaN
12 12061.47 Fife NaN NaN NaN
38 7000.00 Seatac NaN NaN NaN
41 25900.00 Shoreline NaN NaN NaN

Since there are only a few rows which have been missed, the data for these location coordinates has been extracted from 'http://citylatitudelongitude.com/' and added into the dataframe manually.

In [13]:
lobbyingGeo.loc[0, "county"], lobbyingGeo.loc[0, "latitude"], lobbyingGeo.loc[0, "longitude"] = "King", 47.281954, -122.250388
lobbyingGeo.loc[6, "county"], lobbyingGeo.loc[6, "latitude"], lobbyingGeo.loc[6, "longitude"] = "King", 47.46917, -122.364291
lobbyingGeo.loc[8, "county"], lobbyingGeo.loc[8, "latitude"], lobbyingGeo.loc[8, "longitude"] = "King", 47.364791, -122.104563
lobbyingGeo.loc[12, "county"], lobbyingGeo.loc[12, "latitude"], lobbyingGeo.loc[12, "longitude"] = "Pierce", 47.23206, -122.351726
lobbyingGeo.loc[38, "county"], lobbyingGeo.loc[38, "latitude"], lobbyingGeo.loc[38, "longitude"] = "King", 47.44333, -122.298767
lobbyingGeo.loc[41, "county"], lobbyingGeo.loc[41, "latitude"], lobbyingGeo.loc[41, "longitude"] = "King", 47.756904, -122.342414

lobbyingGeo.head()
Out[13]:
lobbying city county longitude latitude
0 18000.0 Algona King -122.250388 47.281954
1 13687.5 Arlington Snohomish -121.959469 48.181498
2 1900.0 Auburn King -121.821908 47.465495
3 12000.0 Battle Ground Clark -122.531645 45.803592
4 19000.0 Bellevue King -122.207221 47.615471

A further check is is carried out to discover if there are any duplicate rows.

In [14]:
duplicates = len(lobbyingGeo.drop_duplicates(subset=["longitude", "latitude"])) != len(lobbyingGeo)
print("Any Duplicate coordinates present?:", duplicates)
Any Duplicate coordinates present?: True

Since duplicate rows have been discovered, these rows have been queried (to return the correct index of these records) and then manually re-entered.

In [15]:
lobbyingGeo[lobbyingGeo["latitude"] == 
            float(lobbyingGeo[~lobbyingGeo.isin(lobbyingGeo.drop_duplicates(
                subset=["longitude", "latitude"]))].dropna()["latitude"])]
Out[15]:
lobbying city county longitude latitude
11 22490.0 Federal Way King -121.803388 47.432251
39 138092.0 Seattle King -121.803388 47.432251
In [16]:
lobbyingGeo.loc[11, "latitude"], lobbyingGeo.loc[11, "longitude"] = 47.322323, -122.312622
lobbyingGeo.loc[39, "latitude"], lobbyingGeo.loc[39, "longitude"] = 47.608013, -122.335167

When re-running the duplicates check, the following result is confirmed.

In [17]:
print("Any Duplicate coordinates present?:", len(lobbyingGeo.drop_duplicates(subset=["longitude", "latitude"])) != len(lobbyingGeo))
Any Duplicate coordinates present?: False

The data is now in a suitable state for visualising. It is passed into R and plotted. Two visualisations are shown below. On the left is a density plot, showing a broad overview which highlights areas of the state of Washington that had a generally significant concentration of datapoints. This chart reveals that the vast majority of the payments are coming from the highly urbanised area around Seattle. On the right, the location and scale of the payments coming from the Seattle area are shown on an urban roadmap.

The maps are pulled from Google Maps and are then overlayed with visual information through the use of the ggplot2 library. The minimum and maximum longitude and latitude values have been used in calculations to automatically set locational central points for the visualisations.

The 'plot_grid()' function of the 'cowplot' package is used to plot the visualisations side by side. This library is used throughout the rest of the document in order to arrange ggplot objects in multiple-plot formats.

In [18]:
%%R -i lobbyingGeo -w 7 -h 4 --units in -r 180
require(ggmap)

lat <- c(min(lobbyingGeo$latitude), max(lobbyingGeo$latitude))
lon <- c(min(lobbyingGeo$longitude), max(lobbyingGeo$longitude))

plot <- get_map(location=c(mean(lon), mean(lat)), zoom=7, source='google', maptype='terrain')
state <- ggmap(plot, extent='device') + 
  geom_density2d(data=lobbyingGeo, aes(x=longitude, y=latitude), size=0.1, color='black') + 
  stat_density2d(data=lobbyingGeo, 
                 aes(x=longitude, y=latitude, fill=..level.., alpha=..level..), size=0.01, 
                 bins=16, geom='polygon') + 
  scale_fill_gradient(low='green', high='red', guide=FALSE) + 
  scale_alpha(range=c(0, 0.3), guide=FALSE)

plot <- get_map(location=c(median(lobbyingGeo$longitude), median(lobbyingGeo$latitude)), 
                zoom=9, source='google', maptype='roadmap')
cities <- ggmap(plot, extent='device') + 
  geom_point(aes(x=longitude, y=latitude), data=lobbyingGeo, col="darkred", alpha=0.4, 
              size=lobbyingGeo$lobbying/max(lobbyingGeo$lobbying)*15) +
  scale_alpha(range = c(0, 0.3), guide = FALSE)

plot_grid(state, cities, ncol=2, labels=NULL)

On visual observation, it is clear that the vast majority of lobbying money is coming from this highly urbanised area. The previous treemap and graph plot highlighted that a large sum of money was coming from Seattle itself, but that did not show the true scope of the concentration of payments.

Questions are raised such as whether the lobbying data is being documented accurately. Is it necessary for these highly localised payments to be so distributed, when the distribution lays over such a small area? Further, this chart doesn't consider lobbying payments made from private organisations, businesses and societies. These organisations are also highly likely to be centred inside the city area, and are quite possibly in some cases connected to the politicians and individuals associated with the payments through public bodies.

Exercise 3

• Build a dashboard to highlight interesting information extracted from either of the datasets below:-

UNHCR Syrian Refugees or Refinery Accidents in the United States

• In under ¾ of a page explain the message that you wish to communicate through the dashboard. Briefly

describe the visualisation techniques used to design the dashboard.

The dataset chosen for this exercise is the 'Refinery Accidents in the United States'.

Firstly, the data is read into pandas. The final row has been ommited as this row was a 'Totals' record. If it had been left in the data, it would have heavily skewed the proportionate distribution of the data in comparison with the other records.

In [19]:
refinery = pd.read_excel("19. Data Set - Module 6 - accidents-state.xlsx")[:-1]
refinery.tail()
Out[19]:
State # of RMP facilities # of accidents # of deaths # of injuries # evacuated Property damage (dollars)
48 Virginia 135 13 0 7 846 267905
49 Washington 268 15 7 11 264 21204494
50 West Virginia 77 16 0 18 38 30025100
51 Wisconsin 263 25 1 12 6262 2831200
52 Wyoming 62 12 0 10 32 73598500

To map this data, 'shape file' data is needed. The 'maps' package inside R contains many useful datasets of the type. The shape data for states is loaded into the R kernal and exported for use by pandas in Python.

In [20]:
%%R -o all_states
require(maps)

all_states <- map_data("state")
head(all_states)
       long      lat group order  region subregion
1 -87.46201 30.38968     1     1 alabama      <NA>
2 -87.48493 30.37249     1     2 alabama      <NA>
3 -87.52503 30.37249     1     3 alabama      <NA>
4 -87.53076 30.33239     1     4 alabama      <NA>
5 -87.57087 30.32665     1     5 alabama      <NA>
6 -87.58806 30.32665     1     6 alabama      <NA>

Next, the two datasets are to be joined. Before joining, an exploratory query is performed in order to assess whether the data is complete enough to sucessfully be combined. To allow the join to accurately match the state names, the names in the original dataset are first converted to all lower case. A right join is then performed (so that no rows from the original dataset are lost) and a query is performed to display any NaN values present.

In [21]:
refinery["State"] = refinery["State"].str.lower()
nulls = all_states.join(refinery.set_index("State"), how="right", on="region")
nulls[nulls.isnull().any(1) == True]

# note on how guam, hawaii and alaska are low so will be ignored.  
# puerto rico and the virgin islands very high so will be ignored and explored for further interest.
# all will be left in the frame in order to facilitate any potenential calculations.
Out[21]:
long lat group order region subregion # of RMP facilities # of accidents # of deaths # of injuries # evacuated Property damage (dollars)
15599 NaN NaN NaN NaN alaska NaN 43 7 0 1 20 103
15599 NaN NaN NaN NaN guam NaN 4 0 0 0 0 0
15599 NaN NaN NaN NaN hawaii NaN 16 4 0 1 600 500
15599 NaN NaN NaN NaN puerto rico NaN 66 3 0 71 2000 500000
15599 NaN NaN NaN NaN virgin islands of the u.s. NaN 1 4 0 16 1300 8540000

All the mainland states were sucessfully joined, with any remaining NaN values being the result of offshore states. Most of these do not show particularly interesting data, however the row for the Virgin Islands is quite astonising. While there is only one facility, there have been numerous accidents and a large amount of collateral damage. We are primarily interested in the mainland states however, so will ignore these rows.

The two datasets are now joined, and the columns renamed with more tidy (and functionally useful) titles.

In [22]:
refinery.rename(columns={"# of RMP facilities":"facility_count", "# of accidents":"accidents", 
                            "# of deaths":"deaths", "# of injuries":"injuries", 
                            "# evacuated":"evacuated", "Property damage (dollars)":"property_damage"}, inplace=True)
refineryGeo = all_states.join(refinery.set_index("State"), how="right", on="region")
refineryGeo.drop(columns="subregion", inplace=True)
refineryGeo.head()
Out[22]:
long lat group order region facility_count accidents deaths injuries evacuated property_damage
1 -87.462006 30.389681 1.0 1.0 alabama 196 31 0 23 59 182682
2 -87.484932 30.372492 1.0 2.0 alabama 196 31 0 23 59 182682
3 -87.525032 30.372492 1.0 3.0 alabama 196 31 0 23 59 182682
4 -87.530762 30.332386 1.0 4.0 alabama 196 31 0 23 59 182682
5 -87.570869 30.326654 1.0 5.0 alabama 196 31 0 23 59 182682

Next, the data in the newly joined frame is plotted. The shape file can be filled in with colour gradients, in order to visualise the scale of each column of interest across America. This time, a black and white map is loaded from the 'Stamen' maps API (through ggmap) and the shapfile plotted over it. The plot_grid() function is used to combined multiple plots together. Colours have been assigned to differentiate between general metrics, collateral damage and human cost.

In [23]:
%%R -i refineryGeo -w 7 -h 3 --units in -r 180

lat <- c(min(all_states$lat), max(all_states$lat))
lon <- c(min(all_states$lon), max(all_states$lon))

plot <- get_map(location=c(mean(lon), mean(lat)), zoom=4, source='stamen', maptype='toner')

facility_count <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=facility_count), colour='black', size=0.1, alpha=0.4) +
  scale_fill_continuous(low='thistle2', high='darkblue', guide=FALSE)

damage <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=property_damage), colour='black', size=0.1, alpha=0.4) +
  scale_fill_continuous(low='thistle2', high='chocolate', guide=FALSE)

injuries <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=injuries), colour='black', size=0.1, alpha=0.4) +
  scale_fill_continuous(low='thistle2', high='darkred', guide=FALSE)

accidents <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=accidents), colour='black', size=0.1, alpha=0.4) +
  scale_fill_continuous(low='thistle2', high='darkblue', guide=FALSE)

evacuated <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=evacuated), colour='black', size=0.1, alpha=0.4) +
  scale_fill_continuous(low='thistle2', high='chocolate', guide=FALSE)

deaths <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo, aes(x=long, y=lat, group=group, fill=deaths), colour='black', size=0.1, alpha=0.4) +
  scale_fill_continuous(low='thistle2', high='darkred', guide=FALSE)

plot_grid(facility_count, damage, deaths, accidents, evacuated, injuries, nrow=2, ncol=3, 
          labels=c('Facilities', 'Damage', 'Deaths', 'Accidents', 'Evacuated', 'Injuries'))

White this display is visually appealing and seems informative, it is flawed. Each state has a diffing number of facilities, meaning that the larger states and states with highest counts of facilities will obviously show a much higher value across all categories related to fallout from an accident. This is the reason that almost every metric simply displays Texas as state of highest proportion.

In order to correct for this, each fallout related column is divided by the total number of facilities in that state. This gives a measure of accidents per facility and should provide a much more level and useful metric of the comparitive safety levels in each state. The adjusted columns are added to the dataframe and a further 'MinMaxScaler' is imported from the SciKit-Learn Machine Learning library and applied, in order to set all values to between 0 and 1.

In [24]:
refinery = refinery.set_index("State").drop(index=[region for region in nulls[nulls.isnull().any(1) == True].region]).reset_index()
refinery["scaled_accidents"] = pd.DataFrame([row["accidents"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery["scaled_damage"] = pd.DataFrame([row["property_damage"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery["scaled_evacuated"] = pd.DataFrame([row["evacuated"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery["scaled_deaths"] = pd.DataFrame([row["deaths"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery["scaled_injuries"] = pd.DataFrame([row["injuries"] / row["facility_count"] for index, row in refinery.iterrows()])
refinery

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 

refinery_scaled = pd.concat([refinery.iloc[:,[0, 1, 2, 3, 4, 5, 6]], 
                                pd.DataFrame(scaler.fit_transform(refinery.loc[:,[
                      "scaled_accidents", "scaled_damage", "scaled_evacuated", "scaled_deaths", "scaled_injuries"]]), 
             columns=["scaled_accidents", "scaled_damage", "scaled_evacuated", "scaled_deaths", "scaled_injuries"])], axis=1)

refineryGeo_scaled = all_states.join(refinery_scaled.set_index("State"), how="inner", on="region")
refineryGeo_scaled.drop(columns="subregion", inplace=True)
refineryGeo_scaled.head()
Out[24]:
long lat group order region facility_count accidents deaths injuries evacuated property_damage scaled_accidents scaled_damage scaled_evacuated scaled_deaths scaled_injuries
1 -87.462006 30.389681 1.0 1 alabama 196 31 0 23 59 182682 0.158163 0.000785 0.003989 0.0 0.007228
2 -87.484932 30.372492 1.0 2 alabama 196 31 0 23 59 182682 0.158163 0.000785 0.003989 0.0 0.007228
3 -87.525032 30.372492 1.0 3 alabama 196 31 0 23 59 182682 0.158163 0.000785 0.003989 0.0 0.007228
4 -87.530762 30.332386 1.0 4 alabama 196 31 0 23 59 182682 0.158163 0.000785 0.003989 0.0 0.007228
5 -87.570869 30.326654 1.0 5 alabama 196 31 0 23 59 182682 0.158163 0.000785 0.003989 0.0 0.007228

With the asjusted and scaled dataset prepared, the data can be plotted again.

In [25]:
%%R -i refineryGeo_scaled -w 7 -h 3 --units in -r 180

lat <- c(min(all_states$lat), max(all_states$lat))
lon <- c(min(all_states$lon), max(all_states$lon))

plot <- get_map(location=c(mean(lon), mean(lat)), zoom=4, source='stamen', maptype='toner')

facility_count <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=facility_count), colour='black', size=0.1, alpha=0.4) +
  scale_fill_continuous(low='thistle2', high='darkblue', guide=FALSE)

damage <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_damage), colour='black', size=0.1, alpha=0.4) +
  lims(fill=c(0, 1)) +
  scale_fill_continuous(low='thistle2', high='chocolate') +
  theme(legend.position='none')

injuries <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_injuries), colour='black', size=0.1, alpha=0.4) +
  lims(fill=c(0, 1)) +
  scale_fill_continuous(low='thistle2', high='darkred') +
  theme(legend.position='none')

accidents <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_accidents), colour='black', size=0.1, alpha=0.4) +
  lims(fill=c(0, 1)) +
  scale_fill_continuous(low='thistle2', high='darkblue') +
  theme(legend.position='none')

evacuated <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_evacuated), colour='black', size=0.1, alpha=0.4) +
  lims(fill=c(0, 1)) +
  scale_fill_continuous(low='thistle2', high='chocolate') +
  theme(legend.position='none')

deaths <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=scaled_deaths), colour='black', size=0.1, alpha=0.4) +
  lims(fill=c(0, 1)) +
  scale_fill_continuous(low='thistle2', high='darkred') +
  theme(legend.position='none')

plot_grid(facility_count, damage, deaths, accidents, evacuated, injuries, nrow=2, ncol=3, 
          labels=c('Facilities', 'Damage', 'Deaths', 'Accidents', 'Evacuated', 'Injuries'))

This time, a much more informative result is delivered. While the 'Facilities' plot is identical, merely directly representing the number of facilities on that state, the other visualisations are all scaled according to the number of facilities in that state. The 'Accidents' overview shows a relatively evenly distributed spread of accidents when weighted against number of facilities, thus we can conclude that the nature of the collateral damage and human cost can be attributed to the nature of the accidents as well as to the response (as opposed to the probability of accidents happening themselves). For example, we can note that while California shows a high proportion of Injuries and Evacuation, it has very few deaths. Conversely, while Washington shows low levels of Injury, Damage and Evacuation, it's Death proportion is high. There are also anomalous exceptions such as Wyoming, which shows an especially high Damage level. As noted, since no specifically higher rate of accidents per facility across the states is observed, these differences could be due to the nature of the refineries themselves and their safety procedures, or in how the emergency procedures for the accidents were handled in cases where they did occur.

Next, this information will be presented in a form which can be more easily, quickly and intuitively assimilated. It was mentioned before that the columns can be divided between collateral damage and human cost. These two columns will be aggregated together and presented in the form of those two categories of interest.

Additionally, a larger map will be presented to display a generalised overall 'risk factor' associated with refineries in that state. To facilitate this, the dimensionality of the data is reduced down using the PCA processing module from the SciKit-Learn library. Before applying the PCA transformation, the data is normalised to a standard normal distribution using the 'StandardScaler' module, also from SciKit-Learn. Once aggregated, the data is plotted on a map.

Finally, because it has now been validated that refineries suffer accidents generally evenly, the two refinery metric charts are of less interest and are omitted.

In [26]:
from sklearn.preprocessing import StandardScaler
x = refinery_scaled.loc[:, ["scaled_accidents", "scaled_damage", "scaled_evacuated", "scaled_deaths", "scaled_injuries"]].values

y = refinery_scaled.loc[:,["State"]].values

x = StandardScaler().fit_transform(x)

from sklearn.decomposition import PCA
pca = PCA(n_components=3)

principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(principalComponents)

principalDf["Overall"] =  pd.DataFrame([row[0]+row[1]+row[2] for index, row in principalDf.iterrows()])

refinery_PCA = pd.concat([refinery_scaled[["State"]], principalDf["Overall"]], axis=1)

refineryGeo_PCA = all_states.join(refinery_PCA.set_index("State"), how="inner", on="region")
refineryGeo_PCA.drop(columns="subregion", inplace=True)
refineryGeo_PCA.head()
Out[26]:
long lat group order region Overall
1 -87.462006 30.389681 1.0 1 alabama -1.281757
2 -87.484932 30.372492 1.0 2 alabama -1.281757
3 -87.525032 30.372492 1.0 3 alabama -1.281757
4 -87.530762 30.332386 1.0 4 alabama -1.281757
5 -87.570869 30.326654 1.0 5 alabama -1.281757

The plot is created using the familiar methods from other recent map plots. It is displayed below.

In [27]:
%%R -i refineryGeo_PCA -w 8 -h 3 --units in -r 150

lat <- c(min(all_states$lat), max(all_states$lat))
lon <- c(min(all_states$lon), max(all_states$lon))

plot <- get_map(location=c(mean(lon), mean(lat)), zoom=4, source='stamen', maptype='toner')

collateral <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=(scaled_damage+scaled_evacuated)/2), colour='black', size=0.1, alpha=0.4) +
  lims(fill=c(0, 1)) +
  scale_fill_continuous(low='thistle2', high='chocolate') +
  theme(plot.title=element_text(size=8), legend.position='none') +
  ggtitle('Collateral Damage')   

human_cost <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand = c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand = c(0, 0)) + 
  geom_polygon(data=refineryGeo_scaled, aes(x=long, y=lat, group=group, fill=(scaled_injuries+scaled_deaths)/2), colour='black', size=0.1, alpha=0.4) +
  lims(fill=c(0, 1)) +
  scale_fill_continuous(low='thistle2', high='darkred') +
  theme(plot.title=element_text(size=8), legend.position='none') +
  ggtitle('Human Cost')

plot <- get_map(location=c(mean(lon), mean(lat)), zoom=4, source='google', maptype='terrain')

overall_risk <- ggmap(plot, extent='device') +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand=c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand=c(0, 0)) + 
  geom_polygon(data=refineryGeo_PCA, aes(x=long, y=lat, group=group, fill=Overall), colour='black', size=0.0, alpha=0.4) +
  lims(fill=c(min(refineryGeo_PCA$Overall), max(refineryGeo_PCA$Overall))) +
  scale_fill_continuous(low='palegreen3', high='darkred') +
  theme(plot.title=element_text(size=8), legend.position='none') +
  ggtitle('Generalised Risk Level') +
  scale_alpha(range=c(0, 0.3), guide=FALSE)

ggdraw() +
  draw_plot(collateral, width=0.45, height=0.45, x=0, y=0.5) +
  draw_plot(human_cost, width=0.45, height=0.45, x=0, y=0) +
  draw_plot(overall_risk, width=0.95, height=0.95, x=0.2, y=0)

Exercise 4

• Build a dashboard to show deficient bridges and explore the unsafe bridges

1. What other questions might you want to answer?

2. What’s the rating of the Skagit River bridge that collapsed into the river? What other distinguishing information can we obtain that might go into a story?

3. When was the bridge last inspected and how often is it inspected?

4. Is it considered fracture-critical, meaning one hit to a span can knock it out?

5. How much average daily traffic does it get?

6. Explain how you got the answers.

The dataset that is to be used here is very messy, so requires some initial work to bring into a usable form. Firstly, it is read into pandas. Because it is part of a multi-sheet excel file, the sheet number has also been specified.

In [28]:
bridgeData = pd.read_excel("Data Set - Module 7 - wa-15.xlsx", sheet_name=0)
bridgeData.tail()
Out[28]:
st_abbv state strucnum rectype nicarid rtsigprf deslev rtnum dirsuff hwyagncy ... avtraff avtrafno scour futdaily futdlyno futyear stat suffrate suffrtno stat2
18787 WA 530 60505000003173 1 5300605050000029696 6 0 58000.0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18788 WA 530 61404000001026 1 5300614040000010240 6 0 47000.0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18789 WA 530 61404000001028 1 5300614040000010240 6 0 47130.0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18790 WA 530 61703000001455 1 5300617030000010240 6 0 54000.0 0 5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18791 WA 530 61708000001524 1 5300617080000010240 6 0 19000.0 0 5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 102 columns

A data-dictionary is also supplied with this file. While the columns of the primary DataFrame do not directly correlate with the rows in the dictionary, is does provide valuable insight into the data available and how to access/interpret it. It is also read in in the same manner as the primary DataFrame and will be referred to in an ad-hoc basis when additional information on the data is required.

In [29]:
bridgeDict = pd.read_excel("Data Module 7 record layout - reclay_1_1.xls", sheet_name=0)
bridgeDict[90:].head()
Out[29]:
FIELD_NAME FIELD_TYPE FIELD_LEN FIELD_DEC DESCRIPTION ITEM# PAGE# (based on the mtguide.pdf, as in Page XX of 223)
90 CHANNEL C 1.0 0.0 Channel/Channel Protection 61 Page 92 of 223 (also see errata.pdf)
91 CULVERT C 1.0 0.0 Culverts 62 Page 94 of 223 (also see errata.pdf)
92 MTHDOPRT C 1.0 0.0 Method Used To Determine Operating Rating 63 Page 96 of 223
93 OPRAT C 3.0 0.0 Operating Rating 64 Page 98 of 223
94 OPRATNO N 5.0 1.0 Numeric conversion for OPRAT 64-NICAR translated Page 98 of 223

This information is used to explore and query the primary DataFrame as follows. The column name or number is not directly related to the primary data, but it is close enough to quickly locate the desired data.

In [30]:
bridgeData.iloc[:, [29, 33]].head()
Out[30]:
year avdayno
0 1921 1920
1 1976 22807
2 2012 50
3 1984 17204
4 1984 9027

The second sheet in the primary spreadsheet contains a small dataset with rows connecting the county names with the county fips codes. This will be necessary, so it is read in and assigned to a variable for future use.

In [31]:
bridgeCounties = pd.read_excel("Data Set - Module 7 - wa-15.xlsx", sheet_name=1)
bridgeCounties["COUNTY"] = bridgeCounties["COUNTY"].str.lower()
bridgeCounties.head()
Out[31]:
STALPHA COUNTY STFIPS CNTYFIPS
0 WA adams 53 1
1 WA asotin 53 3
2 WA benton 53 5
3 WA chelan 53 7
4 WA clallam 53 9

Another shapefile will be needed in order to map and divide the counties of Washington. This will again be imported from the R 'maps' library. The county level data is imported and filtered down to only washington counties, then passed out to Python's kernel in preparation for future use.

In [32]:
%%R -o wa_county

counties <- map_data("county")
wa_county <- subset(counties, region == 'washington')
tail(wa_county)
           long      lat group order     region subregion
86731 -119.8742 46.62158  2932 86731 washington    yakima
86732 -119.8857 46.60439  2932 86732 washington    yakima
86733 -119.8742 46.57001  2932 86733 washington    yakima
86734 -119.8800 46.22050  2932 86734 washington    yakima
86735 -119.8628 46.20905  2932 86735 washington    yakima
86736 -119.8685 46.04289  2932 86736 washington    yakima

Now that all the required data is loaded, the dataset that will be used for analysis can be constructed. A new DataFrame is instantiated and the primary columns of interest are assigned and renamed to more useful headers.

In [33]:
bridges = pd.DataFrame()
bridges["lon"] = bridgeData["longmap"]
bridges["lat"] = bridgeData["latmap"]
bridges["county"] = bridgeData["cnty"]
bridges["status"] = bridgeData["stat2"]
bridges["rating"] = bridgeData["suffrtno"]
bridges.head()
Out[33]:
lon lat county status rating
0 -121.883333 45.691667 59 1 31.1
1 -122.300833 47.661111 33 2 66.6
2 0.000000 0.000000 65 0 92.7
3 -122.536978 47.258722 53 2 81.0
4 -119.344944 46.245694 5 0 95.1

Next, the table is filtered down to display only bridges marked as structurally deficient (holding a value of 1 or 2 in the column derived from the original 'stat2') as well as having a rating of 50 or less (the metric at which they are considered in urgent need of replacement or repair).

In [34]:
bridgesRep = bridges[((bridges["status"] == 2) | (bridges["status"] == 1)) & (bridges["rating"] <= 50)]

bridgesGeo = bridgeCounties[["COUNTY", "CNTYFIPS"]].join(bridgesRep.set_index("county"), how="inner", on="CNTYFIPS").reset_index(drop=True)
bridgesGeo.rename(columns={"COUNTY":"county"}, inplace=True)
bridgesGeo.drop(columns=["CNTYFIPS", "status"], inplace=True)

bridgesGeo.tail()
Out[34]:
county lon lat rating
385 yakima -120.345000 46.325000 27.8
386 yakima -119.999583 46.227639 14.8
387 yakima -120.283333 46.395000 27.5
388 yakima -120.280000 46.400000 6.0
389 yakima -120.401667 46.483333 39.6

Viewing the tail of the dataset reveals that there are 389 bridges in this 'high risk' category. Initially, a density plot will be created as a 'broad' overview on the distribution of deficient bridges.

In [35]:
%%R -i wa_county -i bridgesGeo -w 5 -h 3 --units in -r 200

lat <- c(min(wa_county$lat), max(wa_county$lat))
lon <- c(min(wa_county$lon), max(wa_county$lon))

plot <- get_map(location=c(mean(lon), mean(lat)), zoom=6, source='google', maptype='hybrid')
ggmap(plot, extent='device') + 
  scale_x_continuous(limits=c(min(lon), max(lon)), expand=c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand=c(0, 0)) +
  scale_alpha(range=c(0, 0.3), guide = FALSE) +
  theme_void() +
  labs(title='Concentration of Deficient Bridges') +
  geom_density2d(data=bridgesGeo, aes(x=lon, y=lat), size=0.1, color='white') + 
  stat_density2d(data=bridgesGeo, 
                 aes(x=lon, y=lat, fill=..level.., alpha=..level..), size=0.01, 
                 bins=16, geom='polygon') + 
  scale_fill_gradient(low='green', high='red', guide=FALSE)

It can be observed that the unsafe bridges are distributed widely over the entire state with emphasis on the major population zones. There is a particular concentration around the Seattle area. This could be logical due to the higher concentration of traffic, however it would be expected that these bridges under higher strain would be more frequently and vigilently maintained. This line of inquisition can be explored further.

In the following plot, much more detail will be included than in this broad overview visualisation. In order to facilitate this, rather than plotting on top of a map image, the plot will be drawn from the basic shape file and built up manually. This will provide a much cleaner canvas to work with, and more control over what elements are included. This will allow more aspects of the data to be visualised without the image becoming over-complicated.

So far, the Google and Stamen maps APIs have been relied on in order to provide contextual geographic information. Plotting without the underlaying map will require this information to be added manually. This can be achieved through ggplot2's 'geom_text()' function. In order to plot the names, geographically centred coordinates for each county are required. These are aquired by filtering down and grouping the US zip codes data that was previously utilised by state and county, then calculating the median for each aggregation. This provides a roughly central measure that should allow the names to be plotted correctly.

In [36]:
wa_counties = pd.read_csv("zip_codes_states.csv")
wa_counties = wa_counties[wa_counties["state"] == "WA"]
wa_counties = wa_counties.groupby(by="county").median().reset_index()
wa_counties.drop(columns=["zip_code"], inplace=True)
wa_counties.head()
Out[36]:
county latitude longitude
0 Adams 46.852708 -118.690546
1 Asotin 46.174099 -117.178580
2 Benton 46.155994 -119.402276
3 Chelan 47.701787 -120.354082
4 Clallam 48.067730 -124.360572

Finally, the average level of traffic over each bridge per county is calculated.

In [37]:
bridges["traffic"] = pd.to_numeric(bridgeData["avdayno"], errors="coerce")

bridgesTraffic = bridges[((bridges["status"] == 2) | (bridges["status"] == 1)) & (bridges["rating"] <= 50)]

bridgesGeoTraffic = bridgeCounties[["COUNTY", "CNTYFIPS"]].join(bridgesTraffic.set_index("county"), how="inner", on="CNTYFIPS").reset_index(drop=True)
bridgesGeoTraffic.rename(columns={"COUNTY":"county"}, inplace=True)
bridgesGeoTraffic.drop(columns=["CNTYFIPS", "status"], inplace=True)
trafficMap = wa_county.join(bridgesGeoTraffic[["county", "traffic"]].groupby(by="county").mean(), on="subregion")
trafficMap.head()
Out[37]:
long lat group order region subregion traffic
84970 -118.235573 46.736168 2891.0 84970 washington adams 1611.8
84971 -119.370026 46.741898 2891.0 84971 washington adams 1611.8
84972 -119.375748 46.902325 2891.0 84972 washington adams 1611.8
84973 -118.980423 46.908054 2891.0 84973 washington adams 1611.8
84974 -118.980423 47.257561 2891.0 84974 washington adams 1611.8

The visualisation is now plotted. All the methodology used has been utilised in previous plots, however here it is carefully implemented to show a number of metrics all at once. Bridges meeting our high risk criteria are represented by shaded points - the deeper the colour the more severe the rating of the bridge. The shading of each county represents the average levels of daily traffic over high risk bridges in that county. The density lines from the previous plot are also kept, in order to accentuate the overall distribution of high risk bridges.

In [38]:
%%R -i wa_county -i wa_counties -i bridgesGeo -i trafficMap -w 6 -h 4 --units in -r 200

lat <- c(min(wa_county$lat), max(wa_county$lat))
lon <- c(min(wa_county$lon), max(wa_county$lon))

ggplot() + 
  geom_polygon(data=trafficMap, aes(x=long, y=lat, group=group, fill=traffic), color='black', alpha=0.8, size=0.2) +
  scale_x_continuous(limits=c(min(lon), max(lon)), expand=c(0, 0)) +
  scale_y_continuous(limits=c(min(lat), max(lat)), expand=c(0, 0)) +
  geom_point(data=bridgesGeo, aes(x=lon, y=lat, color=rating), alpha=0.5, size=0.8) +
  scale_colour_gradient(low='thistle2', high='darkred', guide=FALSE) +
  scale_alpha(range=c(0, 0.3), guide=FALSE) +
  theme_void() +
  labs(title='Severity Rating of Deficient Bridges') +
  geom_text(data=wa_counties, aes(x=longitude, y=latitude, label=county), color="black", size=2) +
  geom_density2d(data=bridgesGeo, aes(x=lon, y=lat), size=0.08, alpha=0.6, color='white') +
  scale_fill_continuous(low='forestgreen', high='darkred')

The plot clearly displays that there are high risk bridges distributed across the entire of the state. As we noted from the previous visualisation, the largest concentration of high risk bridges is centred around the highly urbanised Seattle area. This area also sees (besides one anomalous result of Asotin, which may be seeing heavier cross state traffic dut to its location) that the Seattle area has much higher average traffic crossing those bridges than other states. This could go some way towards explaining the higher concentration of high risk bridges in this area.

There is also a notably higher level of traffic crossing dangerous bridges in Skagit county. While it is close to the Seattle area, it shows a higher level of traffic crossing than would be expected, in comparison to other similarly located states. A highly publicised bridge collapse occured in Skagit in 2007. This bridge can be isolated and explored in the data.

It is known that this bridge was built in 1955. All the Skagit bridges built in 1955 which classify a high risk bridges under our definition can be queried by this information. The date of construction is added to the prepared dataset, along with a couple of other columns of interest.

In [39]:
bridges["built"] = bridgeData["year"]
bridges["inspmnth"] = bridgeData["inspmnth"]
bridges["inspyear"] = bridgeData["inspyear"]
bridges["inspfreq"] = bridgeData["inspfreq"]

bridgesSkagit1955 = bridgeCounties[["COUNTY", "CNTYFIPS"]].join(bridges.set_index("county"), how="inner", on="CNTYFIPS").reset_index(drop=True)
bridgesSkagit1955.rename(columns={"COUNTY":"county"}, inplace=True)
bridgesSkagit1955.drop(columns=["CNTYFIPS"], inplace=True)

bridgesSkagit1955[((bridgesSkagit1955["status"] == 2) | (bridgesSkagit1955["status"] == 1)) & (bridgesSkagit1955["rating"] <= 50)
                            & (bridgesSkagit1955["built"] == 1955) & (bridgesSkagit1955["county"] == "skagit")]
Out[39]:
county lon lat status rating traffic built inspmnth inspyear inspfreq
12337 skagit -122.341083 48.443278 2 46.6 70948.0 1955 6 15 24.0

This is the bridge which collapsed. Our query returns a singular bridge, which fits the description precisely. It can be verified that this is the correct bridge by comparison with information on that bridge here: https://bridgehunter.com/wa/skagit/4794A0000000/.

It is observable that this bridge has a very high level of daily traffic compared to other high risk bridges.

In [40]:
print("Mean daily traffic of all high risk bridges:", bridges[((bridges["status"] == 2) | (bridges["status"] == 1)) & (bridges["rating"] <= 50)]["traffic"].mean())
Mean daily traffic of all high risk bridges: 6743.541025641026

It can also be observed that this bridge is scheduled to be inspected every 24 months and that the last inspection was June 2015 (since the repair of the damage suffered on the bridge's collapse).